library(magrittr)
library(tidyverse)
library(limma)
library(edgeR)
library(AnnotationHub)
library(fgsea)
library(msigdbr)
library(goseq)
library(scales)
library(DT)
library(ggrepel)
library(ggeasy)
library(cqn)
library(ggfortify)
library(cowplot)
library(pathview)
library(pander)
library(ggpubr)
library(viridis)
library(pheatmap)
library(fgsea)
library(biomaRt)
library(Gviz)
library(BSgenome.Drerio.UCSC.danRer11)

theme_set(theme_bw())

Introduction

This analysis will be doing a differential gene expression and differential pathway analysis from fmr1 knockout larvae at 2dpf. Previously, Stephen Pederson and Melanie Hand did a DE analysis using limma. Here, I will do one with the generalised linear model (GLM) capablities of edgeR as this, in my experience, detects more DE genes. I then will do some pathway analysis on DE genes (if we get enough), and also the entire dataset using ranked lists.

Data description

This was taken from FMR1_analysis.Rmd

Prior to count-level analysis, the initial dataset was pre-processed using the following steps:

  • Adapters were removed from any reads derived from RNA fragments < 300bp
  • Bases were removed from the end of reads when the quality score dipped below 20
  • Reads < 35bp after trimming were discarded

After trimming alignment was performed using STAR v2.5.3a to the Danio rerio genome included in Ensembl Release 94 (GRCz11). Aligned reads were counted for each gene if the following criteria were satisfied:

  • Alignments were unique
  • Alignments strictly overlapped exonic regions

Counts were then loaded and set up as a DGEList. During loading genes were only retained if receiving more than one counted alignment in at least 4 samples.

Make annotation object and read in gene counts

ah <- AnnotationHub() %>%
  subset(species == "Danio rerio") %>%
  subset(dataprovider == "Ensembl") %>%
  subset(rdataclass == "EnsDb")
ensDb <- AnnotationHub()[["AH64906"]]
genes <- genes(ensDb)
x <- here::here("2_alignedData/featureCounts/genes.out") %>%
    read_tsv() %>%
    as.data.frame() %>%
    column_to_rownames("Geneid") %>%
    set_colnames(basename(colnames(.))) %>%
    set_colnames(str_remove(colnames(.), "_Aligned.+")) %>%
    magrittr::extract(rowSums(cpm(.) > 1) >= 4,) %>%
    DGEList(
        samples = tibble(
            sample = colnames(.),
            group = case_when(
                    sample %in% paste0("S", 1:8) ~ "FMR1",
                    !sample %in% paste0("S", 1:8)  ~ "Wildtype"
                  )) %>%
            mutate(group = factor(group, levels = c("Wildtype", "FMR1"))) %>%
            as.data.frame(),
        genes = genes[rownames(.)] %>%
            as.data.frame() %>%
            mutate(location = paste0(seqnames, ":", start, "-", end, ":", strand)) %>%
            dplyr::select(
                ensembl_gene_id = gene_id,
                location,
                description, 
                external_gene_name = gene_name
            )
    ) %>%
    calcNormFactors(method = "TMM") %>%
    estimateDisp()
## Design matrix not provided. Switch to the classic mode.

PCA analysis:

The overall similarity between samples can be explored by principal component analysis (PCA). PCA is a dimension reducing technique which uses linear combinations of the original data (i.e. the logCPMs of the genes in the RNA-seq dataset) to define a new set of variables (principal components).

Samples which have similar gene expression profiles will cluster together in a plot of principal components (PC) PC1 vs PC2. In the graph below, we see some seperation of samples by genotype across PC1. This means that fmr1 genotype is the largest source of variation and explains ~35% of the variance. Sample S5 doesnt group with the rest of the fmr1 samples, so it is likely an outlier. I will do a DE and pathway analysis including it.

pca_logCPM <- x$counts %>% 
    cpm(log = TRUE) %>% 
    t() %>% 
    prcomp()
pca_logCPM %>%
    autoplot(data = x$samples, colour = "group", size = 4) +
    geom_text_repel(
        aes(colour = group, label = sample), 
        show.legend = FALSE
    ) +
    labs(colour = "Genotype")
*PCA analysis using logCPM values after filtering of undectable genes. Sample S5 does appear most similar to the wiltype samples, however the genotype has been checked via expression of fmr1*

PCA analysis using logCPM values after filtering of undectable genes. Sample S5 does appear most similar to the wiltype samples, however the genotype has been checked via expression of fmr1

DE Analysis

I will use the generalised linear model functionality of the package edgeR to perform the DE analysis. edgeR uses a negative binomial variance function and estimates dispersions using the Cox-Reid profile-adjusted likelihood (CR) method. Tests for differential expression were perforned using likelihood ratio tests.

Design Matrix

The design matrix specifies which samples contain which characteristics. Here, WT was set as the intercept (or baseline) and the effect of FMR1 KO was set as the slope coefficient.

design <- model.matrix(~group, data = x$samples)
colnames(design) <- str_remove(colnames(design), "group")

Model Fit

Dispersions were already esitmated when generating the dge object x. This code below fits the negative binomial model.

glmFit <- glmFit(x, design)

Likelihood ratio tests

This code below tests for differential expression. Genes were considered DE if the FDR adjusted p-value was < 0.05.

topTable_glm <- glmFit %>% 
    glmLRT(coef = "FMR1") %>%
    topTags(n = Inf, adjust.method = "fdr", sort.by = "p") %>% 
    .$table %>% 
    as_tibble() %>% 
    mutate(DE = FDR < 0.05)

MD plot

Mean difference plots (MD)-plots show the average expression vs logFC. The DE genes had low/mild average expression (logCPM between ~0.3 and 6.7). Importantly, there appears to be a small bias in this dataset. This may impact GSEA analysis as genesets with low-medium expressed genes will appear as enriched for being upregulated.

md1 <- topTable_glm %>% 
    ggplot(aes(x = logCPM, y = logFC)) +
    geom_point(aes(colour = DE), alpha = 0.5) +
    geom_smooth(se = FALSE) +
    scale_color_manual(values = c("grey50", "red")) +
    geom_label_repel(
        aes(label = external_gene_name, colour = DE), 
        data = . %>% dplyr::filter(DE == TRUE), 
        show.legend = FALSE
    ) 
md1

CQN

In response to the observed bias CQN was used to rectify this issue as it may be derived from either a length or GC artefact. GC and length information is easily avaaible from EnsDb objects from release 98 onwards, and this release was used. No expresed genes were noted as absent from the annotations for Release 98, despite the differences between releases. GC content and Length were taken as weighted averages and simple averages respectively.

ens98 <- ah[["AH74989"]]
ex <- exonsBy(ens98, "tx")
tr <- transcripts(ens98)
tr$len <- ex %>%
    width %>%
    lapply(sum) %>%
    .[names(tr)] %>%
    unlist()
gnBias <- mcols(tr) %>%
    as.data.frame() %>%
    group_by(gene_id) %>%
    summarise(
        n = n(),
        meanGC = sum(len*gc_content) / sum(len),
        len = mean(len)
    )
mcols(genes) %<>%
    as.data.frame() %>%
    left_join(gnBias) %>%
    DataFrame()
gcCqn <- cqn(
    counts = x$counts,
    x = enframe(rownames(x), name = c()) %>%
        dplyr::rename(gene_id = value) %>% 
        left_join(gnBias) %>%
        .[["meanGC"]],
    lengths = enframe(rownames(x), name = c()) %>%
        dplyr::rename(gene_id = value) %>% 
        left_join(gnBias) %>%
        .[["len"]],
    sizeFactors = x$samples$lib.size
)
x$offset <- gcCqn$glm.offset
par(mfrow = c(1, 2))
cqnplot(gcCqn, n = 1, xlab = "GC Content")
cqnplot(gcCqn, n = 2, xlab = "Length")
*Model fits for GC content and gene length under the CQN model. Variability is clearly visible.*

Model fits for GC content and gene length under the CQN model. Variability is clearly visible.

par(mfrow = c(1, 1))
logCPM <- gcCqn$y + gcCqn$offset
logCPM %>%
    t() %>%
    prcomp() %>%
    autoplot(data = x$samples, colour = "group", size = 4) +
    geom_text_repel(
        aes(label = sample, colour = group),
        show.legend = FALSE
    ) +
    labs(colour = "Genotype")
*PCA after CQN showing similar patterns of separation as prior to this normalisation.*

PCA after CQN showing similar patterns of separation as prior to this normalisation.

Likelihood ratio tests

This code below tests for differential expression using Likelihood ratio tests. Genes were considered DE if the FDR adjusted p-value was < 0.05.

glmCQN <- glmFit(x, design) 
topTableCQN <- glmCQN %>% 
    glmLRT(coef = "FMR1") %>%
    topTags(n = Inf, adjust.method = "fdr", sort.by = "p") %>% 
    .$table %>% 
    as_tibble() %>% 
    mutate(DE = FDR < 0.05)

Comparison of MD Plots

md2 <- topTableCQN %>% 
    ggplot(aes(x = logCPM, y = logFC)) +
    geom_point(aes(colour = DE), alpha = 0.5) +
    geom_smooth(se = FALSE) +
    scale_color_manual(values = c("grey50", "red")) +
    geom_label_repel(
        aes(label = external_gene_name, colour = DE), 
        data = . %>% dplyr::filter(DE == TRUE), 
        show.legend = FALSE
    )
plot_grid(
    md1 + 
        coord_cartesian(ylim = c(-3, 3)) + 
        theme(legend.position = "none"),
    md2 + 
        coord_cartesian(ylim = c(-3, 3)) + 
        theme(
            legend.position = c(1, 0), 
            legend.justification = c(1, 0),
            legend.background = element_rect(colour = "grey30", size = 1/4)
        ),
    nrow = 1,
    labels = c("A", "B")
)
*Comparison of MD plots before and after CQN. The bias evident in A) the pre-CQN plots is no longer present in B) the post-CQN plot, suggesting GC and length bias were contributing factors.*

Comparison of MD plots before and after CQN. The bias evident in A) the pre-CQN plots is no longer present in B) the post-CQN plot, suggesting GC and length bias were contributing factors.

Volcano plot

topTableCQN %>% 
    ggplot(
        aes(x = logFC, y = -log10(PValue), colour = DE)
    ) +
    geom_point(alpha = 0.5) +
    scale_color_manual(values = c("grey50", "red")) +
    geom_label_repel(
        aes(label = external_gene_name), 
        data = . %>% dplyr::filter(DE == TRUE), 
        show.legend = FALSE
    ) +
    theme_bw()

A total of 22 genes were considered as significant as given in the table below.

topTableCQN %>% 
    dplyr::filter(DE) %>%
    dplyr::select(
        ID = ensembl_gene_id,
        Gene = external_gene_name, 
        starts_with("log"),
        PValue, FDR, 
        Location = location, 
        Description = description
    ) %>%
    datatable(
        caption = "The most highly ranked genes for differential expression.",
        rownames = FALSE,
        options = list(
            pageLength = 25
        )
    ) %>%
    formatRound(
        columns = c("logFC", "logCPM"), 
        digits = 2
    ) %>%
    formatSignif(
        columns = c("PValue", "FDR"),
        digits = 3
    )
topTable_glm %>%
    write.csv(here::here("results/DE_results_fmr1KO_KB.csv"))

Visualisation of the DE genes

Boxplots of Expression Values

LogCPMs of the significantly DE genes are shown in the plot below.

logCPM %>%
    .[dplyr::filter(topTableCQN, DE == TRUE)$ensembl_gene_id,] %>% 
    as.data.frame() %>% 
    rownames_to_column("ensembl_gene_id") %>% 
    gather(key = "sample", value = "logCPM", colnames(x$counts)) %>% 
    left_join(x$samples, by = "sample") %>% 
    left_join(x$genes) %>% 
    ggplot(aes(x = group, y = logCPM, fill = group)) +
    geom_boxplot(alpha = 0.5) + 
    geom_jitter(width = 0.1, colour = "grey30") +
    facet_wrap(~external_gene_name, scales = "free_y", ncol = 4) +
    theme_bw() +
    theme(legend.position = "none") 
*Boxplots of all DE genes with points overlaid*

Boxplots of all DE genes with points overlaid

Enrichment Testing

pwf <- topTableCQN %>%
    left_join(gnBias, by = c("ensembl_gene_id" = "gene_id")) %>%
    with(
        nullp(
            DEgenes = structure(DE, names = ensembl_gene_id),
            bias.data = meanGC
        )
    )

Defining Gene Sets

ens2Entrez <- genes[rownames(x)] %>%
    mcols() %>%
    as.data.frame() %>%
    as_tibble() %>%
    dplyr::select(
        gene_id, 
        entrez_gene = entrezid
    ) %>%
    unchop(entrez_gene) %>%
    dplyr::filter(!is.na(entrez_gene)) 
geneByPos <- x$genes %>% 
    as_tibble() %>%
    separate(location, into = c("chr", "range", "strand"), sep = ":") %>% 
    split(f = .$ensembl_gene_id) %>% 
    lapply(extract2, "chr")
posByChr <- x$genes %>% 
    as_tibble() %>%
    separate(location, into = c("chr", "range", "strand"), sep = ":") %>% 
    split(f = .$chr) %>% 
    lapply(extract2, "ensembl_gene_id")
hm <- msigdbr("Danio rerio", category = "H") %>%
    inner_join(ens2Entrez) %>%
    distinct(gs_name, gene_id, .keep_all = TRUE)
hmByGene <- hm %>%
  split(f = .$gene_id) %>%
  lapply(extract2, "gs_name")
hmByID <- hm %>%
  split(f = .$gs_name) %>%
  lapply(extract2, "gene_id")
kg <- msigdbr("Danio rerio", category = "C2", subcategory = "CP:KEGG") %>%
    inner_join(ens2Entrez) %>%
    distinct(gs_name, gene_id, .keep_all = TRUE)
kgByGene <- kg %>%
  split(f = .$gene_id) %>%
  lapply(extract2, "gs_name")
kgByID <- kg %>%
  split(f = .$gs_name) %>%
  lapply(extract2, "gene_id")

Testing Within the Set of DE Genes

Chromosomal Position

goseq(pwf, gene2cat = geneByPos) %>% 
    dplyr::filter(numDEInCat > 0) %>%
    as_tibble() %>%
    mutate(
        adjP = p.adjust(over_represented_pvalue, "bonferroni"),
        Expected = sum(topTableCQN$DE) * numInCat / nrow(x)
    ) %>%
    dplyr::select(
        chromosome = category, Expected, Observed = numDEInCat, N = numInCat, PValue = over_represented_pvalue, adjP
    ) %>%
    pander(
        caption = "Results for enrichment testing within the DE set of genes, based on chromosomal location."
    )

Using manually entered categories. Calculating the p-values…

Results for enrichment testing within the DE set of genes, based on chromosomal location.
chromosome Expected Observed N PValue adjP
14 0.7991 12 664 0 0
7 1.134 3 942 0.1189 0.9513
6 0.9869 2 820 0.2752 1
10 0.7859 1 653 0.5314 1
22 0.757 1 629 0.5699 1
17 0.8461 1 703 0.57 1
19 0.8677 1 721 0.5838 1
4 0.8364 1 695 0.5869 1

Hallmark GeneSets

goseq(pwf, gene2cat = hmByGene) %>% 
    dplyr::filter(numDEInCat > 0) %>%
    as_tibble() %>%
    mutate(
        adjP = p.adjust(over_represented_pvalue, "bonferroni"),
        Expected = sum(topTableCQN$DE) * numInCat / nrow(x)
    ) %>%
    dplyr::select(
        Category = category, Expected, Observed = numDEInCat, `GS Size` = numInCat, PValue = over_represented_pvalue, adjP
    ) %>%
    pander(
        caption = "Results for enrichment testing for Hallmark Gene Sets within the DE set of genes",
        split.tables = Inf,
        justify = "lrrrrr"
    )

Using manually entered categories. For 14633 genes, we could not find any categories. These genes will be excluded. To force their use, please run with use_genes_without_cat=TRUE (see documentation). This was the default behavior for version 1.15.1 and earlier. Calculating the p-values…

Results for enrichment testing for Hallmark Gene Sets within the DE set of genes
Category Expected Observed GS Size PValue adjP
HALLMARK_ESTROGEN_RESPONSE_EARLY 0.207 2 172 0.01198 0.08385
HALLMARK_PROTEIN_SECRETION 0.1119 1 93 0.1011 0.7079
HALLMARK_INFLAMMATORY_RESPONSE 0.1396 1 116 0.1269 0.8886
HALLMARK_COMPLEMENT 0.1889 1 157 0.1678 1
HALLMARK_ESTROGEN_RESPONSE_LATE 0.1974 1 164 0.1775 1
HALLMARK_P53_PATHWAY 0.219 1 182 0.1871 1
HALLMARK_MTORC1_SIGNALING 0.2383 1 198 0.206 1

KEGG Pathways

goseq(pwf, gene2cat = kgByGene) %>% 
    dplyr::filter(numDEInCat > 0) %>%
    as_tibble() %>%
    mutate(
        adjP = p.adjust(over_represented_pvalue, "bonferroni"),
        Expected = sum(topTableCQN$DE) * numInCat / nrow(x)
    ) %>%
    dplyr::select(
        Category = category, Expected, Observed = numDEInCat, `GS Size` = numInCat, PValue = over_represented_pvalue, adjP
    ) %>%
    pander(
        caption = "Results for enrichment testing for KEGG Gene Sets within the DE set of genes",
        split.tables = Inf,
        justify = "lrrrrr"
    )

Using manually entered categories. For 14556 genes, we could not find any categories. These genes will be excluded. To force their use, please run with use_genes_without_cat=TRUE (see documentation). This was the default behavior for version 1.15.1 and earlier. Calculating the p-values…

Results for enrichment testing for KEGG Gene Sets within the DE set of genes
Category Expected Observed GS Size PValue adjP
KEGG_LYSOSOME 0.1372 2 114 0.003041 0.02433
KEGG_GLYCOSPHINGOLIPID_BIOSYNTHESIS_GLOBO_SERIES 0.008425 1 7 0.005353 0.04282
KEGG_GALACTOSE_METABOLISM 0.02407 1 20 0.01409 0.1128
KEGG_SPHINGOLIPID_METABOLISM 0.04092 1 34 0.02638 0.2111
KEGG_DRUG_METABOLISM_CYTOCHROME_P450 0.03611 1 30 0.02822 0.2258
KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450 0.03731 1 31 0.0284 0.2272
KEGG_GLYCEROLIPID_METABOLISM 0.04453 1 37 0.02907 0.2325
KEGG_GLUTATHIONE_METABOLISM 0.04934 1 41 0.03899 0.3119

Since only 14 genes were found as DE, I will perform a GSEA on ranked lists (rather than overrepresentation in the DE genes) to see if any groups of genes within pre-defined gene sets are showing shifts towards up or downregulation as a group. I will look at the KEGG and HALLMARK gene sets available from MSigDB (https://www.gsea-msigdb.org/gsea/msigdb/genesets.jsp). These gene sets have human gene entrez identifiers. So I will convert these human entrez ids to zebrafish ensembl ids using a mapping file downloaded from bioMart (https://m.ensembl.org/biomart). Some genes in the KEGG pathway gene sets did not contain a zebrafish orthologue. Therefore, the gene sets sometimes contained < 10 genes. I filtered out any KEGG pathway gene sets with less than 10 genes as they would not be particularly informative.

Testing Using Ranked Lists

FRY

fry from the limma package approximates the ROAST method. ROAST uses residual space rotation as a sort of continuous version of sample permutation. Like permutation tests, it protects against false positives caused by correlations between genes in the set (Wu et al, 2010). Using this method, I did not find any significantly altered gene sets in a directional context (FDR column), or mixed (so no directionality is taken into account (FDR.mixed column)).

fryRes <- logCPM %>%
    fry(
        index = c(hmByID, kgByID, posByChr), 
        design = design, 
        contrast = "FMR1", 
        sort = "mixed"
    ) %>% 
    rownames_to_column("geneset") %>% 
    as_tibble() 
    
fryRes %>%
    arrange(PValue.Mixed) %>% 
    mutate(FDR.Mixed = p.adjust(PValue.Mixed, "fdr")) %>%
    dplyr::select(geneset, NGenes, PValue = PValue.Mixed, FDR = FDR.Mixed) %>%
    dplyr::slice(1:10) %>%
    pander(
        caption = paste(
            "The", nrow(.), "most highly ranked gene sets across Chromosomal, Hallmark and KEGG testing using Fry.",
            "Tests were non-directional (Mixed)."
        ),
        split.tables = Inf,
        justify = "lrrr"
    )
The 10 most highly ranked gene sets across Chromosomal, Hallmark and KEGG testing using Fry. Tests were non-directional (Mixed).
geneset NGenes PValue FDR
14 664 0.0006031 0.158
KEGG_HISTIDINE_METABOLISM 23 0.004863 0.3537
HALLMARK_MYC_TARGETS_V2 57 0.005539 0.3537
KEGG_VALINE_LEUCINE_AND_ISOLEUCINE_DEGRADATION 44 0.006401 0.3537
KEGG_RNA_POLYMERASE 26 0.007102 0.3537
KEGG_GLYCOSPHINGOLIPID_BIOSYNTHESIS_GLOBO_SERIES 7 0.008145 0.3537
KEGG_GALACTOSE_METABOLISM 20 0.009449 0.3537
KEGG_GLYCEROLIPID_METABOLISM 37 0.01628 0.3867
KEGG_RNA_DEGRADATION 54 0.01818 0.3867
KEGG_GLYCOSYLPHOSPHATIDYLINOSITOL_GPI_ANCHOR_BIOSYNTHESIS 22 0.01909 0.3867

Enrichment testing on chromosome 14 genes.

Fry showed that genes on chromosome 14 tend to be differentially expressed. I want to determine whether there may be some biological reason why this is oberved. Therefore, I will test whether the top most altered genes from chromosome 14 are enriched in any KEGG or HALLMARK gene sets,

I will use fgsea, the fast implementation of the GSEA algorithm, using the chromosomal position gene sets on an undirectional ranked list (i.e. most DE -> to least DE). The leading edge fromt fgsea will give the genes which are the most dysregulated from chromosome 14.

I then will perform an over-representation analysis (using the KEGG and HALLMARK gene sets) on the leading edge genes to determine whether they are have a particular biological function.

# make the ranks
ranks <- topTableCQN %>% 
    mutate(rank = log10(1/PValue)) %>% 
    arrange(rank) %>% 
    dplyr::select(c("ensembl_gene_id", "rank")) %>% #only want the Pvalue with sign
    with(structure(rank, names = ensembl_gene_id)) %>% 
    rev()
png("rankedlist.png", width = 10, height = 10, units = "cm", res = 600)
plot(ranks, 
     xlab = "Genes", ylab = "-log10(PValue)")
dev.off()
## quartz_off_screen 
##                 2
# perform fgsea on all the chr gene sets. 
gsea_chr <- fgseaSimple(pathways =  posByChr, 
      stats = ranks, 
       nperm = 1e5
) %>% 
    as_tibble()

gsea_chr %>% 
    arrange(pval) %>% 
    mutate(padj = p.adjust(pval, "bonferroni")) %>% 
    dplyr::filter(padj < 0.05) %>% 
    dplyr::select(-leadingEdge) %>% 
    pander(
        caption = "Results for enrichment testing using fgsea based on chr position."
    )
Results for enrichment testing using fgsea based on chr position.
pathway pval padj ES NES nMoreExtreme size
14 1e-05 0.00026 0.5325 1.38 0 664
22 3e-05 0.00078 0.4641 1.202 2 629
# plot the enrichment
png("chr14_enrichment.png", width = 10, height = 10, units = "cm", res = 600)
plotEnrichment(pathway = posByChr$`14`,
               stats = ranks)
dev.off()
## quartz_off_screen 
##                 2
png("chr22_enrichment.png", width = 10, height = 10, units = "cm", res = 600)
plotEnrichment(pathway = posByChr$`22`,
               stats = ranks)
dev.off()
## quartz_off_screen 
##                 2

Next, I will take the leading edge for chromosome 14 (i.e. the genes in the ranked list from chromosome 14 which are before the peak on the enrichment plot above) and determine whether they are over-represented in any KEGG gene sets using the kegga function from the limma package.

First, we perfomed over-representation analysis using kegga() of the leading edge subset against all detectable genes in the RNA-seq experiment. The KEGG gene sets for RNA polymerase and NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION were found to be signifcatly enriched (by FDR).

kegga(de = gsea_chr %>% dplyr::filter(pathway == "14") %>% 
        .$leadingEdge %>% .[[1]], 
      geneid = rownames(x$counts), 
      universe = rownames(x$counts),
      gene.pathway = kgByID %>%
        unlist %>%
        as.data.frame() %>%
        rownames_to_column("pathway") %>%
        set_colnames(c("pathway", "gene_id")) %>%
        mutate(pathway = pathway %>% str_remove_all("[0-9]+$")) %>%
        dplyr::select(gene_id, pathway)
) %>% 
    topKEGG() %>% 
    as.data.frame() %>% 
    rownames_to_column("pathway") %>% 
    mutate(FDR = p.adjust(P.DE, "fdr"),
           pbonf = p.adjust(P.DE, "bonferroni")) %>% 
    dplyr::select(pathway, raw.p = P.DE, FDR, pbonf, `No. in leadingedge` = DE, `No. in geneset` = N) %>%
    head(10) %>% 
    pander(caption = "Results of over-representation of genes in the leading edge of chromosome 14 against all genes detetcted in RNA-seq experiment")
Results of over-representation of genes in the leading edge of chromosome 14 against all genes detetcted in RNA-seq experiment (continued below)
pathway raw.p FDR
KEGG_RNA_POLYMERASE 0.003029 0.04454
KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION 0.004454 0.04454
KEGG_AMYOTROPHIC_LATERAL_SCLEROSIS_ALS 0.01777 0.09158
KEGG_GLYCOSAMINOGLYCAN_BIOSYNTHESIS_HEPARAN_SULFATE 0.02737 0.09158
KEGG_LYSOSOME 0.04011 0.09158
KEGG_APOPTOSIS 0.04143 0.09158
KEGG_LONG_TERM_POTENTIATION 0.04143 0.09158
KEGG_DRUG_METABOLISM_CYTOCHROME_P 0.04472 0.09158
KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P 0.04746 0.09158
KEGG_TYROSINE_METABOLISM 0.05311 0.09158
pbonf No. in leadingedge No. in geneset
0.06058 3 26
0.08908 6 136
0.3554 3 49
0.5475 2 23
0.8022 4 114
0.8286 3 68
0.8286 3 68
0.8944 2 30
0.9491 2 31
1 2 33

As an alternate viewpoint, an over-representation analysis was performed on the leading edge genes from chr14 relative to all genes on chromosome 14. No significant gene sets were observed.

kegga(de = gsea_chr %>% dplyr::filter(pathway == "14") %>% 
        .$leadingEdge %>% .[[1]], 
      geneid = rownames(x$counts), 
      universe = x$genes %>% 
        dplyr::filter(grepl(x = location, pattern = "^14:")) %>% 
        .$ensembl_gene_id,
      gene.pathway = kgByID %>%
        unlist %>%
        as.data.frame() %>%
        rownames_to_column("pathway") %>%
        set_colnames(c("pathway", "gene_id")) %>%
        mutate(pathway = pathway %>% str_remove_all("[0-9]+$")) %>%
        dplyr::select(gene_id, pathway)
) %>% 
    topKEGG() %>% 
    as.data.frame() %>% 
    rownames_to_column("pathway") %>% 
    mutate(FDR = p.adjust(P.DE, "fdr"),
           pbonf = p.adjust(P.DE, "bonferroni")) %>% 
    dplyr::select(pathway, raw.p = P.DE, FDR, pbonf, `No. in leadingedge` = DE, `No. in geneset` = N) %>%
    head(10) %>% 
    pander(caption = "Results of over-representation of genes in the leading edge of chromosome 14 against all genes on chr 14")
Results of over-representation of genes in the leading edge of chromosome 14 against all genes on chr 14 (continued below)
pathway raw.p FDR pbonf
KEGG_AMYOTROPHIC_LATERAL_SCLEROSIS_ALS 0.02956 0.1599 0.5912
KEGG_LONG_TERM_POTENTIATION 0.02956 0.1599 0.5912
KEGG_ALZHEIMERS_DISEASE 0.03428 0.1599 0.6856
KEGG_FOCAL_ADHESION 0.07797 0.1599 1
KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION 0.0886 0.1599 1
KEGG_APOPTOSIS 0.09101 0.1599 1
KEGG_RNA_POLYMERASE 0.09101 0.1599 1
KEGG_DNA_REPLICATION 0.09593 0.1599 1
KEGG_MTOR_SIGNALING_PATHWAY 0.09593 0.1599 1
KEGG_NATURAL_KILLER_CELL_MEDIATED_CYTOTOXICITY 0.09593 0.1599 1
No. in leadingedge No. in geneset
3 3
3 3
4 5
4 6
6 11
3 4
3 4
2 2
2 2
2 2

check for nacre contamination

It has come to our attention that some of the samples (i.e. pools of larvae) may have arisen from fish heterozygous for a nacre (w2) allele of the gene mifta, as well as the fmr1 gene.

In each pool (sample), the larvae which showed a nacre phenotype (i.e. no pigment cells) were removed from the analysis. However, these would be larvae which are homozygous for the nacre allele. Heterozygous nacre larvae could also be present in this sample.

To investigate to what extent there are contaminiating nacre larvae in our RNA-seq samples, I will look at the number of reads which align to the nacre site on chromosome 6.

# commenting out this section so I dont need to mount phoenix each time
# the bam files are on th mounted drive on phoenix. 
# bams <- list.files("/fast/users/a1211024/fmr1/03_alignedData/bams", pattern = ".bam")

# turn this off as we aligned to ensembl genome
# options(ucscChromosomeNames=FALSE)
# 
# #make the reference sequence with the chromosome name as 6, not chr6
# temp = DNAStringSet(c("6"=paste(as.vector(Drerio$chr6), length(Drerio$chr6), TRUE),collapse=""))
# 
# atrack_s2 <- AlignmentsTrack(
#   "/fast/users/a1211024/fmr1/03_alignedData/bams/SRR11873955_GSM4578112_fmr1_1_Danio_rerio_RNA-Seq_Aligned.sortedByCoord.out.bam", 
#   isPaired = TRUE, chromosome = "6", genome = "danRer11", 
#   name = "Alignment", max.height = 10, 
#   referenceSequence = SequenceTrack(temp, 
#                                     name = "Ref", chromosome = "6"), 
#   size = 10)
# 
# png("gviz/sample_s2.png", width = 2.5, height = 20, units = "cm", res = 1000)
# plotTracks(list(atrack_s2), chromosome = "6",
#          from = 43429185, to = 43429186, main = "S2", sizes = 1)
# dev.off()
# 
# 
# atrack_s4 <- AlignmentsTrack(
#   "/fast/users/a1211024/fmr1/03_alignedData/bams/SRR11873956_GSM4578113_fmr1_2_Danio_rerio_RNA-Seq_Aligned.sortedByCoord.out.bam", 
#   isPaired = TRUE, chromosome = "6", genome = "danRer11", 
#   name = "Alignment", max_height = 10, 
#   referenceSequence = SequenceTrack(temp, 
#                                     name = "Ref", chromosome = "6"))
# 
# png("gviz/sample_s4.png", width = 2.5, height = 20, units = "cm", res = 1000)
# plotTracks(list(atrack_s4), chromosome = "6",
#          from = 43429185, to = 43429186, main = "S4")
# dev.off()
# 
# atrack_s5 <- AlignmentsTrack(
#   "/fast/users/a1211024/fmr1/03_alignedData/bams/SRR11873957_GSM4578114_fmr1_3_Danio_rerio_RNA-Seq_Aligned.sortedByCoord.out.bam", 
#   isPaired = TRUE, chromosome = "6", genome = "danRer11", 
#   name = "Alignment",
#   referenceSequence = SequenceTrack(temp, 
#                                     name = "Ref", chromosome = "6"), 
#   size = 10)
# png("gviz/sample_s5.png", width = 2.5, height = 20, units = "cm", res = 1000)
# plotTracks(list(atrack_s5), chromosome = "6",
#          from = 43429185, to = 43429186, main = "S5")
# dev.off()
# 
# atrack_s8 <- AlignmentsTrack(
#   "/fast/users/a1211024/fmr1/03_alignedData/bams/SRR11873958_GSM4578115_fmr1_4_Danio_rerio_RNA-Seq_Aligned.sortedByCoord.out.bam", 
#   isPaired = TRUE, chromosome = "6", genome = "danRer11", 
#    name = "Alignment",
#   referenceSequence = SequenceTrack(temp, 
#                                     name = "Ref", chromosome = "6"), 
#   size = 10)
# 
# 
# png("gviz/sample_s8.png", width = 2.5, height = 20, units = "cm", res = 1000)
# plotTracks(list(atrack_s8), chromosome = "6",
#          from = 43429185, to = 43429186, main = "S8")
# dev.off()
# 
# atrack_A <- AlignmentsTrack(
#   "/fast/users/a1211024/fmr1/03_alignedData/bams/SRR11873959_GSM4578116_WT_1_Danio_rerio_RNA-Seq_Aligned.sortedByCoord.out.bam", 
#   isPaired = TRUE, chromosome = "6", genome = "danRer11", 
# name = "Alignment",
#   referenceSequence = SequenceTrack(temp, 
#                                     name = "Ref", chromosome = "6"), 
#   size = 10)
# 
# png("gviz/sample_A.png", width = 2.5, height = 20, units = "cm", res = 1000)
# plotTracks(list(atrack_A), chromosome = "6",
#          from = 43429185, to = 43429186, main = "A")
# dev.off()
# 
# atrack_D <- AlignmentsTrack(
#   "/fast/users/a1211024/fmr1/03_alignedData/bams/SRR11873960_GSM4578117_WT_2_Danio_rerio_RNA-Seq_Aligned.sortedByCoord.out.bam", 
#   isPaired = TRUE, chromosome = "6", genome = "danRer11", 
# name = "Alignment",
#   referenceSequence = SequenceTrack(temp, 
#                                     name = "Ref", chromosome = "6"), 
#   size = 10)
# png("gviz/sample_D.png", width = 2.5, height = 20, units = "cm", res = 1000)
# plotTracks(list(atrack_D), chromosome = "6",
#          from = 43429185, to = 43429186, main = "D")
# dev.off()
# 
# 
# atrack_G <- AlignmentsTrack(
#   "/fast/users/a1211024/fmr1/03_alignedData/bams/SRR11873961_GSM4578118_WT_3_Danio_rerio_RNA-Seq_Aligned.sortedByCoord.out.bam", 
#   isPaired = TRUE, chromosome = "6", genome = "danRer11", 
# name = "Alignment",
#   referenceSequence = SequenceTrack(temp, 
#                                     name = "Ref", chromosome = "6"))
# 
# png("gviz/sample_G.png", width = 2.5, height = 20, units = "cm", res = 1000)
# plotTracks(list(atrack_G), chromosome = "6",
#          from = 43429185, to = 43429186, main = "G")
# dev.off()
# 
# atrack_L <- AlignmentsTrack(
#   "/fast/users/a1211024/fmr1/03_alignedData/bams/SRR11873962_GSM4578119_WT_4_Danio_rerio_RNA-Seq_Aligned.sortedByCoord.out.bam", 
#   isPaired = TRUE, chromosome = "6", genome = "danRer11", 
# name = "Alignment",
#   referenceSequence = SequenceTrack(temp, 
#                                     name = "Ref", chromosome = "6"))
# 
# png("gviz/sample_L.png", width = 2.5, height = 20, units = "cm", res = 1000)
# plotTracks(list(atrack_L), chromosome = "6",
#           from = 43429185, to = 43429186, main = "L")
# dev.off()

correlation between proportion of nacre reads with principal components

These next few plots are exploring the correlations between the degree of contamination of the nacre allele with the first four principal components (which together, explain ~85% of the total variance in this dataset).

# table showing number of reads aligning to the W2 site according to samtools view
nacre <- tibble(sample = colnames(x), 
       num_reads = c(39, 51, 55, 44, 47, 58, 65, 42), 
       num_nacre_reads = c(1, 4, 16, 0, 0, 16, 0, 3)) %>% 
    mutate(prop_nacre = num_nacre_reads*100/num_reads)

nacre %>% 
    left_join(x$samples) %>% 
    ggplot(aes(sample, prop_nacre)) +
    geom_col(aes(fill = group)) + 
    scale_fill_manual(values = c("grey50", "red")) +
    facet_wrap(~group, scales = "free_x") +
    labs(x = "Sample", y = "Proportion of nacre reads", fill = "Genotype") +
    ggtitle("Summary of proportion of reads aligning to nacre site") 
correlation of the first four principal components with the proportion of reads aligning to the nacre allele.

correlation of the first four principal components with the proportion of reads aligning to the nacre allele.

    #ggsave("gviz/nacreProp.png", width = 10, height = 5, units = "cm", dpi = 600, scale = 1.2)
# PCA on cqn adjusted vals. 
logCPM %>% 
    t() %>% 
    prcomp %>% 
    .$x %>% 
    as.data.frame() %>% 
    rownames_to_column("sample") %>% 
    left_join(nacre %>% dplyr::select(sample, prop_nacre)) %>% 
    left_join(x$samples %>% dplyr::select(sample, c( fmr1_genotype = group), lib.size)) %>% 
    dplyr::select(-c(PC5, PC6, PC7, PC8)) %>% 
    mutate_at(vars(fmr1_genotype), as.integer) %>%
    column_to_rownames("sample") %>%
    cor() %>%
    corrplot::corrplot(
        diag = FALSE, 
        title = "correlation between the first four PCs\nof the cqn adjusted logCPMs,\nfmr1 genotype, library size,\nand proportion of reads aligning to nacre allele",
        type = "upper", 
        addCoef.col = "black"
  )
correlation of principal components with fmr1 genotype and proportion of nacre reads. The proportion of nacre reads has almost as much correlation as *fmr1* genotype with the firsr four PCs.

correlation of principal components with fmr1 genotype and proportion of nacre reads. The proportion of nacre reads has almost as much correlation as fmr1 genotype with the firsr four PCs.

logCPM %>% 
    t() %>% 
    prcomp %>% 
    .$x %>% 
    as.data.frame() %>% 
    rownames_to_column("sample") %>% 
    left_join(nacre %>% dplyr::select(sample, prop_nacre)) %>% 
    left_join(x$samples %>% dplyr::select(sample, c( fmr1_genotype = group), lib.size)) %>% 
    pivot_longer(cols = starts_with("PC"), values_to = "PC_value", names_to = "Component") %>% 
    dplyr::filter(Component %in% c("PC1", "PC2", "PC3", "PC4")) %>% 
    ggplot(aes(x = PC_value, y = prop_nacre)) +
    geom_point(aes(colour = fmr1_genotype)) +
    theme_bw() +
    facet_wrap(~Component, scales = "free_x") +
    geom_label_repel(aes(label = sample)) +
    geom_smooth(method = "lm") +
    scale_color_manual(values = c("grey50", "red")) +
    theme(legend.position = "bottom") +
    labs(x = "Principal component value", 
         y = "Proportion of nacre reads", 
         colour = "fmr1 genotype") +
    ggtitle("PC values vs proportion nacre")
correlation of principal components with fmr1 genotype and proportion of nacre reads. The proportion of nacre reads has almost as much correlation as *fmr1* genotype with the firsr four PCs.

correlation of principal components with fmr1 genotype and proportion of nacre reads. The proportion of nacre reads has almost as much correlation as fmr1 genotype with the firsr four PCs.

    #ggsave("gviz/PCvsNacre.png", width = 7.5, height = 7.5, units = "cm", scale = 1.5, dpi = 600)

Some correlation is observed with the value of the first four principal components with the proportion of reads aligning to the nacre allele. However, it does not appear to be highly significant, due to the standard errors of the regression lines being quite large.

DGE analysis including nacre in the model

The proportion of nacre reads is a random effect. To incoporate the effect of the degree of nacre contaminantion, I will convert the proportion of nacre contamination to a catagorial covariate (as percentages/proportions do not work well in linear models see https://stats.stackexchange.com/questions/103731/what-are-the-issues-with-using-percentage-outcome-in-linear-regression ),

Proportion of nacre alleles will be classified into three bins: - no nacre (0) - low nacre (1 > x < 10) or - high nacre (x > 10).

# add nacre to x$samples
x$samples %<>% 
    left_join(nacre %>% 
                dplyr::select(sample, prop_nacre)) %>% 
    mutate(nacre_factor = case_when(
        prop_nacre == 0 ~ "none", 
        prop_nacre < 10 ~ "high", 
        TRUE ~ "low") %>% 
        as.factor())

x$samples$nacre_factor <- relevel(x$samples$nacre_factor, ref = 'none')

differential expression using limma and a blocking variable

To deal with the random effect, I will use a blocking variable in limma: duplicateCorrelations, http://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/limma/html/dupcor.html

The example I am mostly following is found here. https://www.biostars.org/p/54565/

# perform the voom transformation
voom <- voom(x, design, plot=FALSE)

# estimate the correlation between replicates on the blocking variable (nacre proporttion. 
# this takes a while, as it fits a mixed model for each gene
corfit <- duplicateCorrelation(voom, design, block=x$samples$nacre_factor)

# fit the linear model
fit <- lmFit(voom, design, 
             block=x$samples$nacre_factor, 
             correlation=corfit$consensus.correlation) %>% 
    eBayes()

# call the toptable
topTable_nacreBlock <- topTable(fit, coef="FMR1", n=Inf, adjust.method = "fdr", sort.by = "p") %>% 
    as_tibble() %>% 
    mutate(DE = adj.P.Val < 0.05) 

topTable_nacreBlock %>% 
    head(10) %>% 
    pander(caption = "top 20 most DE genes after using the degree of nacre contamination as a blocking variable")
top 20 most DE genes after using the degree of nacre contamination as a blocking variable (continued below)
ensembl_gene_id location
ENSDARG00000037433 14:20156477-20191281:+
ENSDARG00000041257 14:17108003-17121713:-
ENSDARG00000052609 14:9056071-9066756:+
ENSDARG00000037097 14:30367602-30390145:-
ENSDARG00000104007 10:21650828-21820057:+
ENSDARG00000102435 7:45975537-45976956:+
ENSDARG00000036155 14:38932441-38946808:-
ENSDARG00000101393 6:18364048-18366339:-
ENSDARG00000104491 22:1205787-1245226:-
ENSDARG00000061585 14:29975268-29990757:+
Table continues below
description external_gene_name logFC
fragile X mental retardation 1 [Source:ZFIN;Acc:ZDB-GENE-020731-6] fmr1 -1.195
smoothelin-like 1 [Source:ZFIN;Acc:ZDB-GENE-050306-23] smtnl1 2.009
uncharacterized LOC100005318 [Source:NCBI gene;Acc:100005318] CU468164.1 5.608
solute carrier family 7 (cationic amino acid transporter, y+ system), member 2 [Source:ZFIN;Acc:ZDB-GENE-041212-5] slc7a2 1.097
protocadherin 1 gamma 1 [Source:ZFIN;Acc:ZDB-GENE-050608-1] pcdh1g1 1.821
pleckstrin homology domain containing, family F (with FYVE domain) member 1 [Source:ZFIN;Acc:ZDB-GENE-040426-1289] plekhf1 1.381
galactosidase, alpha [Source:ZFIN;Acc:ZDB-GENE-041010-207] gla 1.699
si:dkey-31g6.6 [Source:ZFIN;Acc:ZDB-GENE-110209-1] si:dkey-31g6.6 -0.9758
adenosine monophosphate deaminase 2a [Source:ZFIN;Acc:ZDB-GENE-091204-4] ampd2a 0.7346
cytochrome P450, family 4, subfamily V, polypeptide 7 [Source:ZFIN;Acc:ZDB-GENE-061103-88] cyp4v7 1.476
AveExpr t P.Value adj.P.Val B DE
5.029 -12.08 3.088e-07 0.005646 4.508 TRUE
2.962 9.529 2.706e-06 0.02474 1.716 TRUE
-1.801 8.892 5.024e-06 0.03061 -2.048 TRUE
3.746 6.134 0.0001165 0.358 0.6948 FALSE
0.8814 6.07 0.0001268 0.358 -1.244 FALSE
3.081 6.056 0.0001291 0.358 0.2514 FALSE
1.813 6.01 0.0001371 0.358 -0.6305 FALSE
5.529 -5.652 0.0002218 0.4224 0.6961 FALSE
2.666 5.574 0.000247 0.4224 -0.2985 FALSE
1.103 5.501 0.0002731 0.4224 -1.327 FALSE

I also want to see the limma analysis without the blocking variable to determine whether the results change.

DE analysis without blocking

# fit the linear model
fit_noBlock <- lmFit(voom, design) %>% 
    eBayes()

# call the toptable
topTable_nonacreBlock <- topTable(fit_noBlock, coef="FMR1", n=Inf, adjust.method = "fdr", sort.by = "p") %>% 
    as_tibble() %>% 
    mutate(DE = adj.P.Val < 0.05) 

topTable_nonacreBlock %>% 
    head(10) %>% 
    pander(caption = "top 20 most DE genes after using the degree of nacre contamination as a blocking variable")
top 20 most DE genes after using the degree of nacre contamination as a blocking variable (continued below)
ensembl_gene_id location
ENSDARG00000037433 14:20156477-20191281:+
ENSDARG00000041257 14:17108003-17121713:-
ENSDARG00000052609 14:9056071-9066756:+
ENSDARG00000037097 14:30367602-30390145:-
ENSDARG00000102435 7:45975537-45976956:+
ENSDARG00000104007 10:21650828-21820057:+
ENSDARG00000036155 14:38932441-38946808:-
ENSDARG00000104491 22:1205787-1245226:-
ENSDARG00000101393 6:18364048-18366339:-
ENSDARG00000104353 6:9664206-9695294:-
Table continues below
description external_gene_name logFC
fragile X mental retardation 1 [Source:ZFIN;Acc:ZDB-GENE-020731-6] fmr1 -1.193
smoothelin-like 1 [Source:ZFIN;Acc:ZDB-GENE-050306-23] smtnl1 2.027
uncharacterized LOC100005318 [Source:NCBI gene;Acc:100005318] CU468164.1 5.588
solute carrier family 7 (cationic amino acid transporter, y+ system), member 2 [Source:ZFIN;Acc:ZDB-GENE-041212-5] slc7a2 1.092
pleckstrin homology domain containing, family F (with FYVE domain) member 1 [Source:ZFIN;Acc:ZDB-GENE-040426-1289] plekhf1 1.389
protocadherin 1 gamma 1 [Source:ZFIN;Acc:ZDB-GENE-050608-1] pcdh1g1 1.8
galactosidase, alpha [Source:ZFIN;Acc:ZDB-GENE-041010-207] gla 1.7
adenosine monophosphate deaminase 2a [Source:ZFIN;Acc:ZDB-GENE-091204-4] ampd2a 0.7355
si:dkey-31g6.6 [Source:ZFIN;Acc:ZDB-GENE-110209-1] si:dkey-31g6.6 -0.9592
NOP58 ribonucleoprotein homolog (yeast) [Source:ZFIN;Acc:ZDB-GENE-040426-2140] nop58 -0.5463
AveExpr t P.Value adj.P.Val B DE
5.029 -11.7 4.143e-07 0.007573 3.307 TRUE
2.962 9.358 3.175e-06 0.02902 0.6249 TRUE
-1.801 8.578 6.877e-06 0.0419 -2.77 TRUE
3.746 5.938 0.0001506 0.4632 -0.001555 FALSE
3.081 5.926 0.000153 0.4632 -0.4563 FALSE
0.8814 5.827 0.0001746 0.4632 -2.023 FALSE
1.813 5.815 0.0001774 0.4632 -1.417 FALSE
2.666 5.421 0.0003049 0.4879 -0.981 FALSE
5.529 -5.398 0.0003148 0.4879 0.1236 FALSE
6.684 -5.327 0.0003482 0.4879 0.07983 FALSE

Since the results do not overly change when including the cacre blocking variable, this provides evidence that the nacre contamination does not overly change the differental expression tests.
A comparison of the ranking statistic (sign(logFC) x -log10(p-value)) with and without blocking did not overly change the most significant genes.

topTable_nonacreBlock %>% 
    mutate(rankstat_noBlock = sign(logFC)*-log10(P.Value)) %>% 
    dplyr::select(external_gene_name, rankstat_noBlock) %>% 
    left_join(topTable_nonacreBlock %>% 
                mutate(rankstat_Block = sign(logFC)*-log10(P.Value)) %>% 
                dplyr::select(external_gene_name, rankstat_Block)) %>% 
    ggplot(aes(rankstat_noBlock, rankstat_Block )) +
    geom_point(alpha = 0.5) +
    labs(x = "-log10(p-value) x sign(logFC) no blocking", 
         y = "-log10(p-value) x sign(logFC) with blocking") 

All my other DE analysis was performed with edgeR, so I wil try performing the blocking with that as well (following section 3.4.2 in the edgeRUsersGuide()).

DGE (edgeR) with blocking variable of nacre

design_nacre <- model.matrix(~nacre_factor + group, data = x$samples)

glmCQN_nacre <- glmFit(x, design_nacre) 

topTableCQN_nacre <- glmCQN_nacre %>% 
    glmLRT(coef = "groupFMR1") %>%
    topTags(n = Inf, adjust.method = "fdr", sort.by = "p") %>% 
    .$table %>% 
    as_tibble() %>% 
    mutate(DE = FDR < 0.05)

glmCQN_nacre %>% 
    glmLRT(coef = "groupFMR1") %>%
    topTags(n = Inf, adjust.method = "fdr", sort.by = "p") %>% 
    .$table %>% 
    as_tibble() %>% 
    mutate(DE = FDR < 0.05)
## # A tibble: 18,280 x 10
##    ensembl_gene_id location description external_gene_n… logFC logCPM    LR
##    <chr>           <chr>    <chr>       <chr>            <dbl>  <dbl> <dbl>
##  1 ENSDARG0000011… 7:7653-… NULL        FO704772.1       -2.25   4.45  29.0
##  2 ENSDARG0000004… 14:1710… smoothelin… smtnl1            2.59   3.32  28.8
##  3 ENSDARG0000003… 14:2015… fragile X … fmr1             -1.06   5.15  17.0
##  4 ENSDARG0000011… 6:28203… zinc finge… CZQB01141835.1   -1.02   4.81  14.2
##  5 ENSDARG0000011… 7:7658-… NULL        FO704772.2       -1.84   4.07  13.2
##  6 ENSDARG0000009… 20:1412… liver-enri… leg1.1            1.96   6.62  12.1
##  7 ENSDARG0000009… 20:1398… liver-enri… leg1.2            1.73   2.70  11.1
##  8 ENSDARG0000008… 22:1919… F-box prot… fbxo42           -1.35   3.20  10.9
##  9 ENSDARG0000002… 16:2644… erythrocyt… epb41b            2.00   4.02  10.4
## 10 ENSDARG0000007… 14:4559… zgc:194285… zgc:194285        2.49   2.11  10.3
## # … with 18,270 more rows, and 3 more variables: PValue <dbl>, FDR <dbl>,
## #   DE <lgl>

When including the level of nacre contamination in the model, only 2 genes were found to be DE due to fmr1 genotype. However, this method does not model the level of nacre contamination as a random effect, so this may not be completely correct.

In summary, the contamination of nacre allele probably doesnt overall effect the transcriptome too much.

BLAST analysis for genetic compensation

This part was taken from Stephen and Melanie’s FMR1_Analysis.Rmd file:

For each of the 3 DE genes, a brief analysis was performed using the Ensembl BLAST server to search all cDNAs within the zebrafish transcriptome, with parameters set for distant homology. For each gene the longest transcript was chosen as the query sequence. Only results with an E-value < 1 were retained.

Note that the 3 DE genes were found from the analysis using limma. These genes were also detected using edgeR.

summarise() ungrouping output (override with .groups argument)

Summary of BLAST matches for fmr1
Gene hit description Total Matches Longest Match
fxr1 fragile X mental retardation, autosomal homolog 1 4 1359
fxr2 fragile X mental retardation, autosomal homolog 2 8 977
anks1b ankyrin repeat and sterile alpha motif domain containing 1B 1 88
pimr138 Pim proto-oncogene, serine/threonine kinase, related 138 1 87
si:dkey-90l23.1 si:dkey-90l23.1 1 84
eno3 enolase 3, (beta, muscle) 1 80
nr2f6a nuclear receptor subfamily 2, group F, member 6a 1 75
ZBTB26 zinc finger and BTB domain containing 26 1 72
nat10 N-acetyltransferase 10 1 46
rybpa RING1 and YY1 binding protein a 1 43
ikzf5 IKAROS family zinc finger 5 4 38
ctdnep1b CTD nuclear envelope phosphatase 1b 1 38
gnb1b guanine nucleotide binding protein (G protein), beta polypeptide 1b 1 35
acvr2aa activin A receptor type 2Aa 2 34
ube2d2 ubiquitin-conjugating enzyme E2D 2 (UBC4/5 homolog, yeast) 2 31
traf3ip2l TRAF3 interacting protein 2-like 1 31

Below is the volcano plot from edgeR with the blast hits for fmr1 highlighted.

topTableCQN %>%
    mutate(`Blast Hit` = external_gene_name %in% fmr1Blast$`Gene hit`) %>% 
    arrange(`Blast Hit`) %>%
    ggplot(aes(logFC, -log10(PValue), label = external_gene_name, colour = `Blast Hit`)) +
    geom_point(alpha = 0.4) +
    geom_label_repel(data = . %>% dplyr::filter(`Blast Hit`),
                    show.legend = FALSE, 
                    size = 3) +
    scale_colour_manual(values = c("grey30", "blue")) +
    theme_bw() +
    theme(aspect.ratio = 1) 

    #ggsave("blastRe_edgeR.png")

Enrichment testing using the blast hits

To more formally test for transcriptional adaptation, I will perform an enrichment analysis, treating the blast hit genes as a gene set. Using fry with a directional hypothesis, this did not reveal any significant upregulation as a group.

blasthits <- 
    read_csv(here::here("results/BLAST_fmr1s-txome.csv")) %>% 
    left_join(as.data.frame(genes), by = c(`Gene hit` = "symbol")) %>% 
    dplyr::select( "Gene hit"  , "gene_id"  ) %>% 
    na.omit() %>% 
    .$gene_id %>% 
    unique

logCPM %>%
    fry(
        index = blasthits,
        design = design, 
        contrast = "FMR1", 
        sort = "directional"
    ) %>% 
    rownames_to_column("geneset") %>% 
    as_tibble() %>% 
    dplyr::select(geneset, NGenes, PValue) %>%
    pander(
        caption = paste(
            "Enrichment analysis using fry on the blast hits. No significant upregulation is observed.",
            "Tests were directional."
        ),
        split.tables = Inf,
        justify = "lrr"
    )
Enrichment analysis using fry on the blast hits. No significant upregulation is observed. Tests were directional.
geneset NGenes PValue
set1 48 0.6288
logCPM %>% 
    as.data.frame() %>% 
    .[blasthits,] %>% 
    na.omit %>% 
    rownames_to_column("gene_id") %>% 
    left_join(genes %>% as_tibble) %>% 
    dplyr::select(colnames(x), gene_name) %>% 
    dplyr::distinct(gene_name, .keep_all = TRUE) %>% 
    column_to_rownames("gene_name") %>% 
    t() %>% 
    pheatmap(color = viridis_pal(option = "viridis")(100), cellwidth = 10, cellheight = 10, 
             main = "CQN-adjusted expression of blast hits")

BLAST Analysis (Other genes)

From FMR1_Analysis.Rmd:

Although not the subject of investigation, the genes detected as DE were analysed using BLAST as a reverse viewpoint on the experiment. None of the genes contained matches to fmr1.

summarise() ungrouping output (override with .groups argument)

Summary of BLAST matches for smtnl1
Gene hit description Total Matches Longest Match
smtna smoothelin a 1 297
smtnb smoothelin b 4 258
si:ch211-195o20.7 si:ch211-195o20.7 1 173
CU929178.1 NUL 1 134
nefla neurofilament, light polypeptide a 3 132
psd2 pleckstrin and Sec7 domain containing 2 1 92
si:dkey-20i20.10 si:dkey-20i20.10 1 88
taf3 TAF3 RNA polymerase II, TATA box binding protein (TBP)-associated facto 2 84
tgfbr2a transforming growth factor beta receptor 2a 1 78
ppp1r37 protein phosphatase 1, regulatory subunit 37 1 76
brd3b bromodomain containing 3b 3 63
cmya5 cardiomyopathy associated 5 2 61
chd5 chromodomain helicase DNA binding protein 5 1 60
chd2 chromodomain helicase DNA binding protein 2 2 58
zgc:175284 zgc:175284 2 54
zgc:173726 zgc:173726 6 52
map6a microtubule-associated protein 6a 2 52
si:dkey-20i20.9 si:dkey-20i20.9 1 52
znf1170 zinc finger protein 1170 1 52
stab2 stabilin 2 4 51
znf1166 zinc finger protein 1166 3 51
igfn1.1 immunoglobulin-like and fibronectin type III domain containing 1, tandem duplicate 1 1 49
zswim8 zinc finger, SWIM-type containing 8 4 47
kcnma1a potassium large conductance calcium-activated channel, subfamily M, alpha member 1a 3 44
gpbp1l1 GC-rich promoter binding protein 1-like 1 2 44
foxn2b forkhead box N2b 1 43
polr2f polymerase (RNA) II (DNA directed) polypeptide F 1 42
lmod1a leiomodin 1a (smooth muscle) 1 41
nktr natural killer cell triggering receptor 1 41
im:7152348 im:7152348 1 39
jak2b Janus kinase 2b 1 35
nop58 NOP58 ribonucleoprotein homolog (yeast) 1 35
zfr2 zinc finger RNA binding protein 2 2 29
chd3 chromodomain helicase DNA binding protein 3 3 23
Volcano plot showing closest genes from the BLAST search on *smtnl1*

Volcano plot showing closest genes from the BLAST search on smtnl1

One gene (nop58) with a BLAST match to smtnl narrowly missed the threshold for being called DE. However, the match was only 35bp.

summarise() ungrouping output (override with .groups argument)

Summary of BLAST matches for CU468164.1
Gene hit description Total Matches Longest Match
si:ch211-262h13.3 si:ch211-262h13.3 5 268
si:ch211-244k5.1 si:ch211-244k5.1 1 267
si:dkey-242h9.3 si:dkey-242h9.3 1 70
or117-1 odorant receptor, family F, subfamily 117, member 1 1 53
cib1 calcium and integrin binding 1 (calmyrin) 2 39
im:7147486 im:7147486 2 35
ppp1r18 protein phosphatase 1, regulatory subunit 18 1 35
Volcano plot showing closest genes from the BLAST search on *smtnl1*

Volcano plot showing closest genes from the BLAST search on smtnl1

Conclusion

From FMR1_Analysis.Rmd: At this stage, no clear evidence can be found for genetic compensation in this dataset. Alternative strategies were briefly investigated using pair-wise alignments between all DE genes, but no compelling results were found. As no specific evidence exists for 3’UTR, 5’UTR or other non-coding regions in compensation, no analysis was conducted on these elements.

Session info

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
##  [1] grid      stats4    splines   parallel  stats     graphics  grDevices
##  [8] utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ensembldb_2.12.1                    AnnotationFilter_1.12.0            
##  [3] GenomicFeatures_1.40.1              AnnotationDbi_1.50.3               
##  [5] Biobase_2.48.0                      BSgenome.Drerio.UCSC.danRer11_1.4.2
##  [7] BSgenome_1.56.0                     rtracklayer_1.48.0                 
##  [9] Biostrings_2.56.0                   XVector_0.28.0                     
## [11] Gviz_1.32.0                         GenomicRanges_1.40.0               
## [13] GenomeInfoDb_1.24.2                 IRanges_2.22.2                     
## [15] S4Vectors_0.26.1                    biomaRt_2.44.4                     
## [17] pheatmap_1.0.12                     viridis_0.5.1                      
## [19] viridisLite_0.3.0                   ggpubr_0.4.0                       
## [21] pander_0.6.3                        pathview_1.28.1                    
## [23] cowplot_1.1.0                       ggfortify_0.4.11                   
## [25] cqn_1.34.0                          quantreg_5.75                      
## [27] SparseM_1.78                        preprocessCore_1.50.0              
## [29] nor1mix_1.3-0                       mclust_5.4.7                       
## [31] ggeasy_0.1.2                        ggrepel_0.8.2                      
## [33] DT_0.16                             scales_1.1.1                       
## [35] goseq_1.40.0                        geneLenDataBase_1.24.0             
## [37] BiasedUrn_1.07                      msigdbr_7.2.1                      
## [39] fgsea_1.14.0                        AnnotationHub_2.20.2               
## [41] BiocFileCache_1.12.1                dbplyr_2.0.0                       
## [43] BiocGenerics_0.34.0                 edgeR_3.30.3                       
## [45] limma_3.44.3                        forcats_0.5.0                      
## [47] stringr_1.4.0                       dplyr_1.0.2                        
## [49] purrr_0.3.4                         readr_1.4.0                        
## [51] tidyr_1.1.2                         tibble_3.0.4                       
## [53] ggplot2_3.3.2                       tidyverse_1.3.0                    
## [55] magrittr_2.0.1                     
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.1.4                    tidyselect_1.1.0             
##   [3] RSQLite_2.2.1                 htmlwidgets_1.5.2            
##   [5] BiocParallel_1.22.0           munsell_0.5.0                
##   [7] statmod_1.4.35                withr_2.3.0                  
##   [9] colorspace_2.0-0              highr_0.8                    
##  [11] knitr_1.30                    rstudioapi_0.13              
##  [13] ggsignif_0.6.0                labeling_0.4.2               
##  [15] KEGGgraph_1.48.0              GenomeInfoDbData_1.2.3       
##  [17] farver_2.0.3                  bit64_4.0.5                  
##  [19] rprojroot_2.0.2               vctrs_0.3.5                  
##  [21] generics_0.1.0                xfun_0.19                    
##  [23] biovizBase_1.36.0             R6_2.5.0                     
##  [25] locfit_1.5-9.4                bitops_1.0-6                 
##  [27] DelayedArray_0.14.1           assertthat_0.2.1             
##  [29] promises_1.1.1                nnet_7.3-14                  
##  [31] gtable_0.3.0                  conquer_1.0.2                
##  [33] rlang_0.4.9                   MatrixModels_0.4-1           
##  [35] lazyeval_0.2.2                rstatix_0.6.0                
##  [37] dichromat_2.0-0               checkmate_2.0.0              
##  [39] broom_0.7.2                   BiocManager_1.30.10          
##  [41] yaml_2.2.1                    abind_1.4-5                  
##  [43] modelr_0.1.8                  crosstalk_1.1.0.1            
##  [45] backports_1.2.0               httpuv_1.5.4                 
##  [47] Hmisc_4.4-1                   tools_4.0.2                  
##  [49] ellipsis_0.3.1                RColorBrewer_1.1-2           
##  [51] Rcpp_1.0.5                    base64enc_0.1-3              
##  [53] progress_1.2.2                zlibbioc_1.34.0              
##  [55] RCurl_1.98-1.2                prettyunits_1.1.1            
##  [57] rpart_4.1-15                  openssl_1.4.3                
##  [59] cluster_2.1.0                 SummarizedExperiment_1.18.2  
##  [61] haven_2.3.1                   here_1.0.0                   
##  [63] fs_1.5.0                      data.table_1.13.2            
##  [65] openxlsx_4.2.3                reprex_0.3.0                 
##  [67] ProtGenerics_1.20.0           matrixStats_0.57.0           
##  [69] hms_0.5.3                     mime_0.9                     
##  [71] evaluate_0.14                 xtable_1.8-4                 
##  [73] XML_3.99-0.5                  rio_0.5.16                   
##  [75] jpeg_0.1-8.1                  readxl_1.3.1                 
##  [77] gridExtra_2.3                 compiler_4.0.2               
##  [79] crayon_1.3.4                  htmltools_0.5.0              
##  [81] mgcv_1.8-33                   later_1.1.0.1                
##  [83] Formula_1.2-4                 lubridate_1.7.9.2            
##  [85] DBI_1.1.0                     corrplot_0.84                
##  [87] rappdirs_0.3.1                Matrix_1.2-18                
##  [89] car_3.0-10                    cli_2.2.0                    
##  [91] pkgconfig_2.0.3               GenomicAlignments_1.24.0     
##  [93] foreign_0.8-80                xml2_1.3.2                   
##  [95] rvest_0.3.6                   VariantAnnotation_1.34.0     
##  [97] digest_0.6.27                 graph_1.66.0                 
##  [99] rmarkdown_2.5                 cellranger_1.1.0             
## [101] fastmatch_1.1-0               htmlTable_2.1.0              
## [103] curl_4.3                      shiny_1.5.0                  
## [105] Rsamtools_2.4.0               lifecycle_0.2.0              
## [107] nlme_3.1-150                  jsonlite_1.7.1               
## [109] carData_3.0-4                 askpass_1.1                  
## [111] fansi_0.4.1                   pillar_1.4.7                 
## [113] lattice_0.20-41               KEGGREST_1.28.0              
## [115] fastmap_1.0.1                 httr_1.4.2                   
## [117] survival_3.2-7                GO.db_3.11.4                 
## [119] interactiveDisplayBase_1.26.3 glue_1.4.2                   
## [121] zip_2.1.1                     png_0.1-7                    
## [123] BiocVersion_3.11.1            bit_4.0.4                    
## [125] Rgraphviz_2.32.0              stringi_1.5.3                
## [127] blob_1.2.1                    org.Hs.eg.db_3.11.4          
## [129] latticeExtra_0.6-29           memoise_1.1.0